11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487 by SoHarshh · Pull Request #1216 · openai/parameter-golf

SoHarshh · 2026-04-01T09:58:54Z

12L Banked + Parallel Muon + Value Embeddings — ttt_bpb 1.1571

Best submittable result: ttt_bpb = 1.1571, 16.47MB (seed 1)

Architecture

12L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
Model banking: qo/kv/mlp_up/mlp_down as 3D tensors [num_layers, M, K]
Parallel Muon: async reduce-scatter on banked grads, no DDP
Value Embeddings: ve_dim=128, last 2 layers
EMA(0.997) with QAT-reset at quantization activation
INT4 MLP + INT4 bigram + INT6 attn + zstd
XSA last 4 layers, Partial RoPE (16/64), LN Scale 1/√(layer+1)
Legal TTT, lr=0.002, 3 epochs

Also included: 11L INT6 XSA-all experiment (quality record, unsubmittable)

ttt_bpb = 1.1487 — new quality best
19.03MB — over 16MB budget (INT6 all layers less compressible than INT4 under LZMA)
GPTQ in progress to bring size below 16MB

…L VE

train_gpt_v2.py: - LZMA compression support (COMPRESS=lzma env var) - Full Hessian GPTQ: gptq_quantize_weight() + collect_gptq_hessians() (GPTQ_ENABLED=1 activates post-training column-wise quantization) train_gpt_v3.py (Parallel Muon): - All replicated-param all_reduces now launched async simultaneously so NCCL can pipeline them (saves ~3-5ms/step vs serial blocking calls) - Removed redundant .contiguous() in non-XSA attention path run.sh: - v11_proxy: 1-GPU smoke test for 11L INT6 stack - v11_int6_xsaall: 11L INT6 + XSA-all + LZMA + VE (train_gpt_v2.py) - v11_gptq: same + GPTQ_ENABLED=1 (train_gpt_v2.py) - v11_banked: 11L INT6 + XSA-all + LZMA + VE + Parallel Muon (train_gpt_v3.py)

…math)

…table)

MatoTeziTanka · 2026-04-11T20:15:12Z

Community Review — 11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487

BPB: 1.1487 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 3da121027713, file records/track_10min_16mb/2026-03-28_12L_INT4_bQAT_VE/train_gpt.py):

The TTT path at line 1017 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=78732 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=78732 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

SoHarshh and others added 9 commits March 28, 2026 08:04

12L INT4 bQAT + Value Embeddings — val_bpb 1.1588

0ab7caf

v7_ve seed 2: new best ttt_bpb=1.1574 (16.41MB)

0f506eb

v7_ve seed 3 (1.1580) + v9_13l code + v10_banked code (train_gpt_v3.py)

e8b8665

Non-record: 13L INT4 attn — 172ms/step, ttt_bpb=1.1640, worse than 12…

f25c703

…L VE

12L banked + Parallel Muon + VE — ttt_bpb=1.1571 (new best)

8a626f9

Add MFU logging to training step output (zero overhead — pure Python …

aae1ad2

…math)

11L INT6 XSA-all seed1: ttt_bpb=1.1487 (quality record, 19MB unsubmit…

f12538f

…table)

Remove PLAN.md from tracking (private planning file)

3da1210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487#1216

11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487#1216
SoHarshh wants to merge 9 commits intoopenai:mainfrom
SoHarshh:submission/12L-banked-VE

SoHarshh commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SoHarshh commented Apr 1, 2026

12L Banked + Parallel Muon + Value Embeddings — ttt_bpb 1.1571

Architecture

Also included: 11L INT6 XSA-all experiment (quality record, unsubmittable)

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — 11L INT6 XSA-all + EMA + VE — ttt_bpb 1.1487

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants